1. Frame

A Portuguese Bank wants to run a direct marketing campaign to sell its new term deposit plan. The goal is to help them identify customers who would most likely buy the plan?

Open Discussion How to approach this problem?

2. Acquire

UCI has a number of datasets related to machine learning. We will leverage the Bank Marketing dataset. Look into this link for more information https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Load the train and test datasets



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [2]:

    
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 8)



In [3]:

    
# Load the train dataset
train = pd.read_csv("../Data/train.csv")



In [4]:

    
# Load the test dataset
test = pd.read_csv("../Data/test.csv")



In [5]:

    
# View the first 5 records of train
train.head()









    Out[5]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      deposit
    
  
  
    
      0
      58
      management
      married
      tertiary
      no
      2143
      yes
      no
      unknown
      5
      may
      261
      1
      -1
      0
      unknown
      no
    
    
      1
      44
      technician
      single
      secondary
      no
      29
      yes
      no
      unknown
      5
      may
      151
      1
      -1
      0
      unknown
      no
    
    
      2
      33
      entrepreneur
      married
      secondary
      no
      2
      yes
      yes
      unknown
      5
      may
      76
      1
      -1
      0
      unknown
      no
    
    
      3
      47
      blue-collar
      married
      unknown
      no
      1506
      yes
      no
      unknown
      5
      may
      92
      1
      -1
      0
      unknown
      no
    
    
      4
      33
      unknown
      single
      unknown
      no
      1
      no
      no
      unknown
      5
      may
      198
      1
      -1
      0
      unknown
      no



In [6]:

    
# View the last 10 records of test
test.tail(10)









    Out[6]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      deposit
    
  
  
    
      9990
      62
      management
      divorced
      tertiary
      no
      5943
      no
      no
      telephone
      17
      feb
      196
      4
      -1
      0
      unknown
      yes
    
    
      9991
      38
      technician
      single
      tertiary
      no
      25
      yes
      no
      cellular
      1
      jun
      232
      2
      -1
      0
      unknown
      yes
    
    
      9992
      25
      management
      single
      tertiary
      no
      316
      no
      no
      cellular
      27
      mar
      347
      2
      -1
      0
      unknown
      yes
    
    
      9993
      43
      technician
      divorced
      unknown
      no
      4389
      no
      no
      cellular
      8
      apr
      618
      1
      -1
      0
      unknown
      yes
    
    
      9994
      45
      admin.
      divorced
      secondary
      no
      0
      no
      no
      cellular
      29
      oct
      264
      1
      -1
      0
      unknown
      yes
    
    
      9995
      78
      retired
      divorced
      primary
      no
      1389
      no
      no
      cellular
      8
      apr
      335
      1
      -1
      0
      unknown
      yes
    
    
      9996
      30
      management
      single
      tertiary
      no
      398
      no
      no
      cellular
      27
      oct
      102
      1
      180
      3
      success
      yes
    
    
      9997
      69
      retired
      divorced
      tertiary
      no
      247
      no
      no
      cellular
      22
      apr
      138
      2
      -1
      0
      unknown
      yes
    
    
      9998
      48
      entrepreneur
      married
      secondary
      no
      0
      no
      yes
      cellular
      28
      jul
      431
      2
      -1
      0
      unknown
      yes
    
    
      9999
      31
      admin.
      single
      secondary
      no
      131
      yes
      no
      cellular
      15
      jun
      151
      1
      -1
      0
      unknown
      yes



In [7]:

    
# List the attributes/feature names/columns in train dataset
train.columns









    Out[7]:





Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'deposit'],
      dtype='object')



In [8]:

    
# List the attributes in test dataset. 
test.columns









    Out[8]:





Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'deposit'],
      dtype='object')



In [9]:

    
type(test.columns)









    Out[9]:





pandas.indexes.base.Index



In [10]:

    
train.columns.values









    Out[10]:





array(['age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'deposit'], dtype=object)



In [11]:

    
test.columns.values









    Out[11]:





array(['age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'deposit'], dtype=object)



In [12]:

    
# Do they match with train?
[x in test.columns.values for x in train.columns.values]









    Out[12]:





[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True]

Attribute Information:

Input variables:

bank client data:

age (numeric)
job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
default: has credit in default? (categorical: 'no','yes','unknown')
housing: has housing loan? (categorical: 'no','yes','unknown')
loan: has personal loan? (categorical: 'no','yes','unknown')

contact: contact communication type (categorical: 'cellular','telephone')
month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

emp.var.rate: employment variation rate - quarterly indicator (numeric)
cons.price.idx: consumer price index - monthly indicator (numeric)
cons.conf.idx: consumer confidence index - monthly indicator (numeric)
euribor3m: euribor 3 month rate - daily indicator (numeric)
nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: 'yes','no')

3. Explore



In [13]:

    
train.dtypes









    Out[13]:





age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object



In [14]:

    
# Find unique values in deposit for train dataset
pd.unique(train.deposit)









    Out[14]:





array(['no', 'yes'], dtype=object)



In [15]:

    
# Find unique values in deposit for test dataset. Are they the same? 
pd.unique(test.deposit)









    Out[15]:





array(['no', 'yes'], dtype=object)



In [16]:

    
pd.unique(test['month'])









    Out[16]:





array(['may', 'apr', 'jun', 'jul', 'aug', 'feb', 'nov', 'jan', 'mar',
       'sep', 'oct', 'dec'], dtype=object)



In [17]:

    
# Find frequency of deposit in train dataset
train.deposit.value_counts()









    Out[17]:





no     31092
yes     4119
Name: deposit, dtype: int64



In [18]:

    
# Find frequency of deposit in test dataset
test.deposit.value_counts()









    Out[18]:





no     8830
yes    1170
Name: deposit, dtype: int64



In [19]:

    
type(train.deposit.value_counts())









    Out[19]:





pandas.core.series.Series



In [20]:

    
# Is the distribution of deposit similar in train and test?
print("train:",train.deposit.value_counts()[1]/train.shape[0]*100)
print("test:",test.deposit.value_counts()[1]/test.shape[0]*100)









    



train: 11.6980489052
test: 11.7



In [21]:

    
# Find number of rows and columns in train 
train.shape









    Out[21]:





(35211, 17)



In [22]:

    
# Find number of rows and columns in test
test.shape









    Out[22]:





(10000, 17)

Find basic summary metrics for the train dataframe



In [23]:

    
train.describe()









    Out[23]:






  
    
      
      age
      balance
      day
      duration
      campaign
      pdays
      previous
    
  
  
    
      count
      35211.000000
      35211.000000
      35211.000000
      35211.000000
      35211.000000
      35211.000000
      35211.000000
    
    
      mean
      40.965153
      1355.947914
      15.802221
      258.191048
      2.759337
      40.104087
      0.582659
    
    
      std
      10.651197
      3060.839946
      8.339288
      257.335241
      3.098252
      100.220917
      2.418828
    
    
      min
      18.000000
      -8019.000000
      1.000000
      0.000000
      1.000000
      -1.000000
      0.000000
    
    
      25%
      33.000000
      71.000000
      8.000000
      103.000000
      1.000000
      -1.000000
      0.000000
    
    
      50%
      39.000000
      447.000000
      16.000000
      180.000000
      2.000000
      -1.000000
      0.000000
    
    
      75%
      48.000000
      1418.000000
      21.000000
      319.000000
      3.000000
      -1.000000
      0.000000
    
    
      max
      95.000000
      102127.000000
      31.000000
      4918.000000
      63.000000
      871.000000
      275.000000

Where did the remaining columns go?

describe only works for continuous variable and not categorical variables

Plots



In [24]:

    
# Create labels: that has 0 for No and 1 for Yes in train dataset
labels = np.where(train.deposit=="no", 0, 1)



In [25]:

    
# Display number of 0 and 1 - check if it is the same as what we saw above?
np.unique(labels, return_counts=True)









    Out[25]:





(array([0, 1]), array([31092,  4119]))

Bi-variate plot : Deposit vs age



In [26]:

    
train.head()









    Out[26]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      deposit
    
  
  
    
      0
      58
      management
      married
      tertiary
      no
      2143
      yes
      no
      unknown
      5
      may
      261
      1
      -1
      0
      unknown
      no
    
    
      1
      44
      technician
      single
      secondary
      no
      29
      yes
      no
      unknown
      5
      may
      151
      1
      -1
      0
      unknown
      no
    
    
      2
      33
      entrepreneur
      married
      secondary
      no
      2
      yes
      yes
      unknown
      5
      may
      76
      1
      -1
      0
      unknown
      no
    
    
      3
      47
      blue-collar
      married
      unknown
      no
      1506
      yes
      no
      unknown
      5
      may
      92
      1
      -1
      0
      unknown
      no
    
    
      4
      33
      unknown
      single
      unknown
      no
      1
      no
      no
      unknown
      5
      may
      198
      1
      -1
      0
      unknown
      no



In [27]:

    
train.loc[:,['deposit','age']]









    Out[27]:






  
    
      
      deposit
      age
    
  
  
    
      0
      no
      58
    
    
      1
      no
      44
    
    
      2
      no
      33
    
    
      3
      no
      47
    
    
      4
      no
      33
    
    
      5
      no
      28
    
    
      6
      no
      42
    
    
      7
      no
      58
    
    
      8
      no
      43
    
    
      9
      no
      41
    
    
      10
      no
      29
    
    
      11
      no
      53
    
    
      12
      no
      57
    
    
      13
      no
      45
    
    
      14
      no
      57
    
    
      15
      no
      60
    
    
      16
      no
      33
    
    
      17
      no
      28
    
    
      18
      no
      32
    
    
      19
      no
      25
    
    
      20
      no
      44
    
    
      21
      no
      39
    
    
      22
      no
      52
    
    
      23
      no
      36
    
    
      24
      no
      57
    
    
      25
      no
      49
    
    
      26
      no
      60
    
    
      27
      no
      59
    
    
      28
      no
      51
    
    
      29
      no
      57
    
    
      ...
      ...
      ...
    
    
      35181
      no
      36
    
    
      35182
      yes
      62
    
    
      35183
      yes
      38
    
    
      35184
      yes
      36
    
    
      35185
      yes
      34
    
    
      35186
      no
      66
    
    
      35187
      no
      46
    
    
      35188
      no
      63
    
    
      35189
      yes
      60
    
    
      35190
      no
      59
    
    
      35191
      yes
      32
    
    
      35192
      yes
      29
    
    
      35193
      no
      25
    
    
      35194
      yes
      32
    
    
      35195
      yes
      75
    
    
      35196
      yes
      29
    
    
      35197
      yes
      68
    
    
      35198
      yes
      25
    
    
      35199
      yes
      36
    
    
      35200
      no
      34
    
    
      35201
      yes
      38
    
    
      35202
      yes
      53
    
    
      35203
      yes
      34
    
    
      35204
      yes
      23
    
    
      35205
      yes
      73
    
    
      35206
      yes
      25
    
    
      35207
      yes
      51
    
    
      35208
      yes
      72
    
    
      35209
      no
      57
    
    
      35210
      no
      37
    
  

35211 rows × 2 columns



In [28]:

    
bivariate_plot_deposit_age = train.loc[:,["deposit", "age"]].copy()



In [29]:

    
bivariate_plot_deposit_age.head()



In [30]:

    
bivariate_plot_deposit_age.age.hist()









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x1100f49b0>



In [31]:

    
sns.stripplot(x="deposit", y = "age", data = bivariate_plot_deposit_age,
             jitter = True, alpha = 0.1)









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x1092d6e48>

Multivariate plot : `Deposit` vs `age` and `pdays`



In [32]:

    
train.plot(kind="scatter", x = 'age', y = 'pdays', color = labels, alpha = 0.5, s=50)









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x1103ff048>

Multivariate plot : `Deposit` vs `day` and `duration`



In [33]:

    
train.plot(kind="scatter", x = 'day', y = 'duration', color = labels, 
           alpha = 0.5, s=50)









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x1109624a8>

4. Refine

covert categorical variables to numeric

Two options:

Label Encoder
One-Hot Encoding

Label Encoding



In [34]:

    
import sklearn
from sklearn import preprocessing



In [35]:

    
# Find the columns that are categorical
train.select_dtypes(include=['object'])









    Out[35]:






  
    
      
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      poutcome
      deposit
    
  
  
    
      0
      management
      married
      tertiary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      1
      technician
      single
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      2
      entrepreneur
      married
      secondary
      no
      yes
      yes
      unknown
      may
      unknown
      no
    
    
      3
      blue-collar
      married
      unknown
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      4
      unknown
      single
      unknown
      no
      no
      no
      unknown
      may
      unknown
      no
    
    
      5
      management
      single
      tertiary
      no
      yes
      yes
      unknown
      may
      unknown
      no
    
    
      6
      entrepreneur
      divorced
      tertiary
      yes
      yes
      no
      unknown
      may
      unknown
      no
    
    
      7
      retired
      married
      primary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      8
      technician
      single
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      9
      admin.
      divorced
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      10
      admin.
      single
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      11
      technician
      married
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      12
      services
      married
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      13
      admin.
      single
      unknown
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      14
      blue-collar
      married
      primary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      15
      retired
      married
      primary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      16
      services
      married
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      17
      blue-collar
      married
      secondary
      no
      yes
      yes
      unknown
      may
      unknown
      no
    
    
      18
      blue-collar
      single
      primary
      no
      yes
      yes
      unknown
      may
      unknown
      no
    
    
      19
      services
      married
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      20
      admin.
      married
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      21
      management
      single
      tertiary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      22
      entrepreneur
      married
      secondary
      no
      yes
      yes
      unknown
      may
      unknown
      no
    
    
      23
      technician
      single
      secondary
      no
      yes
      yes
      unknown
      may
      unknown
      no
    
    
      24
      technician
      married
      secondary
      no
      no
      yes
      unknown
      may
      unknown
      no
    
    
      25
      management
      married
      tertiary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      26
      admin.
      married
      secondary
      no
      yes
      yes
      unknown
      may
      unknown
      no
    
    
      27
      blue-collar
      married
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      28
      management
      married
      tertiary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      29
      technician
      divorced
      secondary
      no
      yes
      no
      unknown
      may
      unknown
      no
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      35181
      admin.
      single
      tertiary
      no
      no
      no
      cellular
      nov
      failure
      no
    
    
      35182
      blue-collar
      married
      secondary
      no
      no
      no
      cellular
      nov
      success
      yes
    
    
      35183
      entrepreneur
      single
      secondary
      no
      no
      no
      cellular
      nov
      success
      yes
    
    
      35184
      admin.
      divorced
      secondary
      no
      yes
      no
      cellular
      nov
      success
      yes
    
    
      35185
      blue-collar
      married
      secondary
      no
      yes
      no
      cellular
      nov
      success
      yes
    
    
      35186
      retired
      married
      secondary
      no
      no
      no
      cellular
      nov
      failure
      no
    
    
      35187
      blue-collar
      married
      secondary
      no
      no
      no
      cellular
      nov
      failure
      no
    
    
      35188
      retired
      married
      secondary
      no
      no
      no
      cellular
      nov
      success
      no
    
    
      35189
      services
      married
      tertiary
      no
      yes
      no
      cellular
      nov
      success
      yes
    
    
      35190
      unknown
      married
      unknown
      no
      no
      no
      cellular
      nov
      failure
      no
    
    
      35191
      services
      single
      secondary
      no
      yes
      no
      cellular
      nov
      unknown
      yes
    
    
      35192
      management
      single
      secondary
      no
      yes
      no
      cellular
      nov
      success
      yes
    
    
      35193
      services
      single
      secondary
      no
      no
      no
      cellular
      nov
      failure
      no
    
    
      35194
      blue-collar
      married
      secondary
      no
      no
      no
      cellular
      nov
      success
      yes
    
    
      35195
      retired
      divorced
      tertiary
      no
      yes
      no
      cellular
      nov
      failure
      yes
    
    
      35196
      management
      single
      tertiary
      no
      no
      no
      cellular
      nov
      unknown
      yes
    
    
      35197
      retired
      married
      secondary
      no
      no
      no
      cellular
      nov
      success
      yes
    
    
      35198
      student
      single
      secondary
      no
      no
      no
      cellular
      nov
      unknown
      yes
    
    
      35199
      management
      single
      secondary
      no
      yes
      no
      cellular
      nov
      unknown
      yes
    
    
      35200
      blue-collar
      single
      secondary
      no
      yes
      no
      cellular
      nov
      other
      no
    
    
      35201
      technician
      married
      secondary
      no
      yes
      no
      cellular
      nov
      unknown
      yes
    
    
      35202
      management
      married
      tertiary
      no
      no
      no
      cellular
      nov
      success
      yes
    
    
      35203
      admin.
      single
      secondary
      no
      no
      no
      cellular
      nov
      unknown
      yes
    
    
      35204
      student
      single
      tertiary
      no
      no
      no
      cellular
      nov
      unknown
      yes
    
    
      35205
      retired
      married
      secondary
      no
      no
      no
      cellular
      nov
      failure
      yes
    
    
      35206
      technician
      single
      secondary
      no
      no
      yes
      cellular
      nov
      unknown
      yes
    
    
      35207
      technician
      married
      tertiary
      no
      no
      no
      cellular
      nov
      unknown
      yes
    
    
      35208
      retired
      married
      secondary
      no
      no
      no
      cellular
      nov
      success
      yes
    
    
      35209
      blue-collar
      married
      secondary
      no
      no
      no
      telephone
      nov
      unknown
      no
    
    
      35210
      entrepreneur
      married
      secondary
      no
      no
      no
      cellular
      nov
      other
      no
    
  

35211 rows × 10 columns



In [36]:

    
train_to_convert = train.select_dtypes(include=["object_"]).copy()
test_to_convert = test.select_dtypes(include=["object_"]).copy()



In [37]:

    
train_np = np.array(train_to_convert)
test_np = np.array(test_to_convert)



In [38]:

    
for i in range(train_np.shape[1]):
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(train_np[:, i]))
    train_np[:,i] = lbl.transform(train_np[:, i])
    test_np[:,i] = lbl.transform(test_np[:, i])



In [39]:

    
# Display train_np
train_np









    Out[39]:





array([[4, 1, 2, ..., 8, 3, 0],
       [9, 2, 1, ..., 8, 3, 0],
       [2, 1, 1, ..., 8, 3, 0],
       ..., 
       [5, 1, 1, ..., 9, 2, 1],
       [1, 1, 1, ..., 9, 3, 0],
       [2, 1, 1, ..., 9, 1, 0]], dtype=object)



In [40]:

    
# How would you transform test?
test_np









    Out[40]:





array([[6, 2, 1, ..., 8, 3, 0],
       [1, 1, 0, ..., 0, 3, 0],
       [5, 1, 1, ..., 6, 3, 0],
       ..., 
       [5, 0, 2, ..., 0, 3, 1],
       [2, 1, 1, ..., 5, 3, 1],
       [0, 2, 1, ..., 6, 3, 1]], dtype=object)



In [41]:

    
train_np









    Out[41]:





array([[4, 1, 2, ..., 8, 3, 0],
       [9, 2, 1, ..., 8, 3, 0],
       [2, 1, 1, ..., 8, 3, 0],
       ..., 
       [5, 1, 1, ..., 9, 2, 1],
       [1, 1, 1, ..., 9, 3, 0],
       [2, 1, 1, ..., 9, 1, 0]], dtype=object)



In [42]:

    
test_np









    Out[42]:





array([[6, 2, 1, ..., 8, 3, 0],
       [1, 1, 0, ..., 0, 3, 0],
       [5, 1, 1, ..., 6, 3, 0],
       ..., 
       [5, 0, 2, ..., 0, 3, 1],
       [2, 1, 1, ..., 5, 3, 1],
       [0, 2, 1, ..., 6, 3, 1]], dtype=object)



In [43]:

    
# Now, merge the numeric and encoded train variables into one single dataset



In [44]:

    
train_numeric = np.array(train.select_dtypes(exclude=["object_"]).copy())



In [45]:

    
train_numeric.shape









    Out[45]:





(35211, 7)



In [46]:

    
train_encoded = np.concatenate([train_numeric, train_np], axis=1)



In [47]:

    
# Now, merge the numeric and encoded test variables into one single dataset
test_numeric = np.array(test.select_dtypes(exclude=["object_"]).copy())



In [48]:

    
test_encoded = np.concatenate([test_numeric, test_np], axis=1)

5. Model



In [49]:

    
# Create train X and train Y



In [50]:

    
xlen = train_encoded.shape[1]-1



In [51]:

    
train_encoded_X = train_encoded[:, :xlen]



In [52]:

    
train_encoded_Y = np.array(train_encoded[:, -1], dtype=float)



In [53]:

    
train_encoded_Y









    Out[53]:





array([ 0.,  0.,  0., ...,  1.,  0.,  0.])



In [54]:

    
# Create test X
test_encoded_X = test_encoded[:, :xlen]



In [55]:

    
# Create test Y
test_encoded_Y = np.array(test_encoded[:, -1], dtype=float)

Benchmark Model



In [56]:

    
model_allzero = test_encoded_Y.copy()



In [57]:

    
model_allzero = 0



In [58]:

    
# The mean square error on AllZero model
print("Mean Squared Error on all zero model: %.2f"
      % (np.mean((model_allzero - test_encoded_Y) ** 2)*100))









    



Mean Squared Error on all zero model: 11.70

First Model: Linear Regression

Y = β0 + β1X1 + β2X2 + … + βn*Xn



In [59]:

    
from sklearn import linear_model



In [60]:

    
model_linear = linear_model.LinearRegression()



In [61]:

    
model_linear.fit(train_encoded_X, train_encoded_Y)









    Out[61]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [62]:

    
# The coefficients
print('Coefficients: \n', model_linear.coef_)









    



Coefficients: 
 [  1.04646483e-03   2.14190275e-06  -4.43191769e-04   4.82999672e-04
  -3.00268937e-03   4.58783750e-04   7.49324089e-03   1.01976457e-03
   2.07698879e-02   1.55435599e-02  -1.97712387e-02  -8.52556895e-02
  -4.47500171e-02  -3.76204027e-02   4.77309366e-03   2.80968562e-02]



In [63]:

    
# Prediction
model_linear_prediction = model_linear.predict(test_encoded_X)



In [64]:

    
model_linear_prediction









    Out[64]:





array([ 0.05588253,  0.144221  ,  0.04835766, ...,  0.11550334,
        0.21314076,  0.05215351])



In [65]:

    
model_linear_prediction = np.where(model_linear_prediction>0.5, 1, 0)



In [66]:

    
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_linear.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))









    



Mean Squared Error on train: 8.14



In [67]:

    
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_linear.predict(test_encoded_X) - test_encoded_Y) ** 2)*100))









    



Mean Squared Error on test: 8.17

Second Model: L2 Logistic Regression



In [68]:

    
model_logistic_L2 = linear_model.LogisticRegression()



In [69]:

    
model_logistic_L2.fit(train_encoded_X, train_encoded_Y)









    Out[69]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [70]:

    
# The coefficients
print('Coefficients: \n', model_logistic_L2.coef_)









    



Coefficients: 
 [[  6.82530131e-03   1.79331639e-05  -7.18550598e-03   3.87348869e-03
   -1.42900945e-01   3.30391815e-03   8.55022031e-02   7.36735209e-03
    1.79800068e-01   1.42854940e-01  -3.02376976e-01  -9.78798877e-01
   -7.33822319e-01  -6.24494186e-01   2.91691916e-02   1.82317196e-01]]



In [71]:

    
# Prediction
model_logistic_L2_prediction = model_logistic_L2.predict(test_encoded_X)



In [72]:

    
np.unique(model_logistic_L2_prediction)









    Out[72]:





array([ 0.,  1.])



In [73]:

    
np.sum(model_logistic_L2_prediction)









    Out[73]:





401.0



In [74]:

    
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_logistic_L2.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))









    



Mean Squared Error on train: 10.86



In [75]:

    
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_logistic_L2_prediction - test_encoded_Y) ** 2)*100))









    



Mean Squared Error on test: 10.97

Third Model: L1 Logistic Regression



In [76]:

    
# Code here. Report evaulation
model_logistic_L1 = linear_model.LogisticRegression(penalty = 'l1')



In [77]:

    
model_logistic_L1.fit(train_encoded_X, train_encoded_Y)









    Out[77]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [78]:

    
# The coefficients
print('Coefficients: \n', model_logistic_L1.coef_)









    



Coefficients: 
 [[  1.05637731e-02   1.68072330e-05  -5.62844791e-03   3.93893768e-03
   -1.38350472e-01   3.65055334e-03   8.73469701e-02   8.46801335e-03
    2.39397964e-01   1.83525906e-01  -3.16044934e-01  -1.04517588e+00
   -7.22688905e-01  -6.21400423e-01   3.55617478e-02   2.14720323e-01]]



In [79]:

    
# Prediction
model_logistic_L1_prediction = model_logistic_L1.predict(test_encoded_X)



In [80]:

    
np.unique(model_logistic_L1_prediction)









    Out[80]:





array([ 0.,  1.])



In [81]:

    
np.sum(model_logistic_L1_prediction)









    Out[81]:





419.0



In [82]:

    
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_logistic_L1.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))









    



Mean Squared Error on train: 10.81



In [83]:

    
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_logistic_L1_prediction - test_encoded_Y) ** 2)*100))









    



Mean Squared Error on test: 11.07

Fourth Model: L1 Logistic Regression - Change value of C



In [84]:

    
model_logistic_L2C = linear_model.LogisticRegression(C = 2)



In [85]:

    
model_logistic_L2C.fit(train_encoded_X, train_encoded_Y)









    Out[85]:





LogisticRegression(C=2, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [86]:

    
# The coefficients
print('Coefficients: \n', model_logistic_L2C.coef_)









    



Coefficients: 
 [[  8.39493786e-03   1.73990030e-05  -6.34146842e-03   3.89666637e-03
   -1.41255272e-01   3.44310785e-03   8.63377831e-02   8.26141345e-03
    1.85344786e-01   1.81572594e-01  -3.18528700e-01  -9.88455139e-01
   -7.29451853e-01  -6.10667371e-01   3.12251656e-02   1.94457931e-01]]



In [87]:

    
# Prediction
model_logistic_L2C_prediction = model_logistic_L2C.predict(test_encoded_X)



In [88]:

    
np.unique(model_logistic_L2C_prediction)









    Out[88]:





array([ 0.,  1.])



In [89]:

    
np.sum(model_logistic_L2C_prediction)









    Out[89]:





401.0



In [90]:

    
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_logistic_L2C.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))









    



Mean Squared Error on train: 10.84



In [91]:

    
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_logistic_L2C_prediction - test_encoded_Y) ** 2)*100))









    



Mean Squared Error on test: 10.99

Fifth Model: Decision Tree Model



In [107]:

    
from sklearn import tree
from sklearn.externals.six import StringIO
# import pydot



In [108]:

    
model_DT = tree.DecisionTreeClassifier()



In [109]:

    
#Let's use only the first two columns as features for the model



In [110]:

    
model_DT.fit(train_encoded_X[:,1:3], train_encoded_Y)









    Out[110]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')



In [111]:

    
# dot_data = StringIO() 
# tree.export_graphviz(model_DT, out_file=dot_data) 
# graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
# graph.write_pdf("dt1.pdf")



In [98]:

    
# Prediction
model_DT_prediction = model_DT.predict(test_encoded_X[:,1:3])



In [99]:

    
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_DT.predict(train_encoded_X[:,1:3]) - train_encoded_Y) ** 2)*100))









    



Mean Percentage Error on train: 3.65



In [100]:

    
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_DT_prediction - test_encoded_Y) ** 2)*100))









    



Mean Percentage Error on test: 17.75

Decision Tree is prone to overfitting !

Now, use all the features and build the model. Report the accuracy



In [113]:

    
model_DTAll = tree.DecisionTreeClassifier()



In [117]:

    
model_DTAll.fit(train_encoded_X, train_encoded_Y)









    Out[117]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')



In [118]:

    
# Prediction
model_DTAll_prediction = model_DTAll.predict(test_encoded_X)



In [119]:

    
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_DTAll.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))









    



Mean Squared Error on train: 0.00



In [120]:

    
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_DTAll_prediction - test_encoded_Y) ** 2)*100))









    



Mean Squared Error on test: 12.45

Sixth Model: Random Forest Model



In [101]:

    
from sklearn.ensemble import RandomForestClassifier



In [ ]:

    
?RandomForestClassifier



In [102]:

    
model_RF = RandomForestClassifier()



In [103]:

    
model_RF.fit(train_encoded_X, train_encoded_Y)









    Out[103]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [104]:

    
# Prediction
model_RF_prediction = model_RF.predict(test_encoded_X)



In [105]:

    
# The mean square error on train
print("Mean Percentage Error on train: %.2f"
      % (np.mean((model_RF.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))









    



Mean Percentage Error on train: 0.77



In [106]:

    
# The mean square error on test
print("Mean Percentage Error on test: %.2f"
      % (np.mean((model_RF_prediction - test_encoded_Y) ** 2)*100))









    



Mean Percentage Error on test: 10.14

Let's change model parameters

Use 400 trees
use maximum depth of 8
print Out of Bag (OOB) score



In [121]:

    
?RandomForestClassifier



In [122]:

    
model_RFMod = RandomForestClassifier(max_depth = 8, oob_score = True, n_estimators = 400 )



In [123]:

    
model_RFMod.fit(train_encoded_X, train_encoded_Y)









    Out[123]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)



In [124]:

    
# Prediction
model_RFMod_prediction = model_RFMod.predict(test_encoded_X)



In [125]:

    
# The mean square error on train
print("Mean Percentage Error on train: %.2f"
      % (np.mean((model_RFMod.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))









    



Mean Percentage Error on train: 8.29



In [126]:

    
# The mean square error on test
print("Mean Percentage Error on test: %.2f"
      % (np.mean((model_RFMod_prediction - test_encoded_Y) ** 2)*100))









    



Mean Percentage Error on test: 9.92

Cross Validation



In [127]:

    
from sklearn.cross_validation import StratifiedKFold



In [ ]:

    
?StratifiedKFold



In [128]:

    
skf = StratifiedKFold(train_encoded_Y, 5, random_state=1131, shuffle=True)



In [129]:

    
for train, test in skf:
    print("%s %s" % (train, test))
    print(train.shape, test.shape)









    



[    0     1     2 ..., 35207 35208 35210] [   10    17    23 ..., 35197 35201 35209]
(28168,) (7043,)
[    0     1     2 ..., 35206 35209 35210] [    4     6    11 ..., 35190 35207 35208]
(28168,) (7043,)
[    0     1     3 ..., 35207 35208 35209] [    2     8    16 ..., 35192 35199 35210]
(28169,) (7042,)
[    2     3     4 ..., 35208 35209 35210] [    0     1     7 ..., 35203 35204 35206]
(28169,) (7042,)
[    0     1     2 ..., 35208 35209 35210] [    3     5     9 ..., 35198 35202 35205]
(28170,) (7041,)



In [130]:

    
model_RF = RandomForestClassifier()



In [131]:

    
for k, (train, test) in enumerate(skf):
    model_RF.fit(train_encoded_X[train], train_encoded_Y[train])
    print("fold:", k+1, model_RF.score(train_encoded_X[test], train_encoded_Y[test]))









    



fold: 1 0.897486866392
fold: 2 0.901320460031
fold: 3 0.903578528827
fold: 4 0.898324339676
fold: 5 0.898593949723

Find mean CV error



In [158]:

    
cv_error = []
for k, (train, test) in enumerate(skf):
    model_RF.fit(train_encoded_X[train], train_encoded_Y[train])
    print(k)
    cv_error.append(np.mean((model_RFMod.predict(train_encoded_X) - train_encoded_Y) ** 2)*100)



In [159]:

    
cv_error









    Out[159]:





[8.2900230041748326,
 8.2900230041748326,
 8.2900230041748326,
 8.2900230041748326,
 8.2900230041748326]



In [163]:

    
# code here
np.mean(cv_error)









    Out[163]:





8.2900230041748326

Run for different parameters, different models and find mean CV error and for different KFolds



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

One Hot Encoding



In [132]:

    
train = pd.read_csv("../Data/train.csv")
test = pd.read_csv("../Data/test.csv")



In [133]:

    
train_one_hot = pd.get_dummies(train)



In [134]:

    
train_one_hot.head()









    Out[134]:






  
    
      
      age
      balance
      day
      duration
      campaign
      pdays
      previous
      job_admin.
      job_blue-collar
      job_entrepreneur
      ...
      month_may
      month_nov
      month_oct
      month_sep
      poutcome_failure
      poutcome_other
      poutcome_success
      poutcome_unknown
      deposit_no
      deposit_yes
    
  
  
    
      0
      58
      2143
      5
      261
      1
      -1
      0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      1
      44
      29
      5
      151
      1
      -1
      0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      2
      33
      2
      5
      76
      1
      -1
      0
      0.0
      0.0
      1.0
      ...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      3
      47
      1506
      5
      92
      1
      -1
      0
      0.0
      1.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      4
      33
      1
      5
      198
      1
      -1
      0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
  

5 rows × 53 columns



In [135]:

    
test_one_hot = pd.get_dummies(test)
test_one_hot.head()









    Out[135]:






  
    
      
      age
      balance
      day
      duration
      campaign
      pdays
      previous
      job_admin.
      job_blue-collar
      job_entrepreneur
      ...
      month_may
      month_nov
      month_oct
      month_sep
      poutcome_failure
      poutcome_other
      poutcome_success
      poutcome_unknown
      deposit_no
      deposit_yes
    
  
  
    
      0
      38
      677
      14
      114
      2
      -1
      0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      1
      58
      5445
      14
      391
      1
      -1
      0
      0.0
      1.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      2
      55
      5
      20
      108
      1
      -1
      0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      3
      26
      63
      28
      76
      4
      -1
      0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
    
      4
      48
      907
      4
      103
      1
      -1
      0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
    
  

5 rows × 53 columns



In [136]:

    
# Check if columns are the same



In [149]:

    
[x in test_one_hot.columns.values for x in train_one_hot.columns.values]









    Out[149]:





[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True]



In [137]:

    
#Create train X , train Y, test X , test Y



In [138]:

    
train_X = train_one_hot.ix[:,:train_one_hot.shape[1]-2 ]



In [139]:

    
train_X.columns









    Out[139]:





Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_failure', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'],
      dtype='object')



In [140]:

    
train_Y = train_one_hot.ix[:, -1]



In [141]:

    
train_Y.head()









    Out[141]:





0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: deposit_yes, dtype: float64



In [142]:

    
test_X = test_one_hot.ix[:,:test_one_hot.shape[1]-2 ]
test_Y = test_one_hot.ix[:, -1]



In [143]:

    
# Run Random Forest and check accuracy



In [144]:

    
model_RF = RandomForestClassifier(n_estimators=400, max_depth=8, oob_score=True, n_jobs=-1)



In [145]:

    
model_RF.fit(train_X, train_Y)









    Out[145]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)



In [146]:

    
# Prediction
model_RF_prediction = model_RF.predict(test_X)



In [147]:

    
# The mean square error on train
print("Mean Percentage Error on train: %.2f"
      % (np.mean((model_RF.predict(train_X) - train_Y) ** 2)*100))









    



Mean Percentage Error on train: 9.87



In [148]:

    
# The mean square error on test
print("Mean Percentage Error on test: %.2f"
      % (np.mean((model_RF_prediction - test_Y) ** 2)*100))









    



Mean Percentage Error on test: 10.54

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	deposit
0	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	deposit
9990	62	management	divorced	tertiary	no	5943	no	no	telephone	17	feb	196	4	-1	0	unknown	yes
9991	38	technician	single	tertiary	no	25	yes	no	cellular	1	jun	232	2	-1	0	unknown	yes
9992	25	management	single	tertiary	no	316	no	no	cellular	27	mar	347	2	-1	0	unknown	yes
9993	43	technician	divorced	unknown	no	4389	no	no	cellular	8	apr	618	1	-1	0	unknown	yes
9994	45	admin.	divorced	secondary	no	0	no	no	cellular	29	oct	264	1	-1	0	unknown	yes
9995	78	retired	divorced	primary	no	1389	no	no	cellular	8	apr	335	1	-1	0	unknown	yes
9996	30	management	single	tertiary	no	398	no	no	cellular	27	oct	102	1	180	3	success	yes
9997	69	retired	divorced	tertiary	no	247	no	no	cellular	22	apr	138	2	-1	0	unknown	yes
9998	48	entrepreneur	married	secondary	no	0	no	yes	cellular	28	jul	431	2	-1	0	unknown	yes
9999	31	admin.	single	secondary	no	131	yes	no	cellular	15	jun	151	1	-1	0	unknown	yes

	age	balance	day	duration	campaign	pdays	previous
count	35211.000000	35211.000000	35211.000000	35211.000000	35211.000000	35211.000000	35211.000000
mean	40.965153	1355.947914	15.802221	258.191048	2.759337	40.104087	0.582659
std	10.651197	3060.839946	8.339288	257.335241	3.098252	100.220917	2.418828
min	18.000000	-8019.000000	1.000000	0.000000	1.000000	-1.000000	0.000000
25%	33.000000	71.000000	8.000000	103.000000	1.000000	-1.000000	0.000000
50%	39.000000	447.000000	16.000000	180.000000	2.000000	-1.000000	0.000000
75%	48.000000	1418.000000	21.000000	319.000000	3.000000	-1.000000	0.000000
max	95.000000	102127.000000	31.000000	4918.000000	63.000000	871.000000	275.000000

	deposit	age
0	no	58
1	no	44
2	no	33
3	no	47
4	no	33
5	no	28
6	no	42
7	no	58
8	no	43
9	no	41
10	no	29
11	no	53
12	no	57
13	no	45
14	no	57
15	no	60
16	no	33
17	no	28
18	no	32
19	no	25
20	no	44
21	no	39
22	no	52
23	no	36
24	no	57
25	no	49
26	no	60
27	no	59
28	no	51
29	no	57
...	...	...
35181	no	36
35182	yes	62
35183	yes	38
35184	yes	36
35185	yes	34
35186	no	66
35187	no	46
35188	no	63
35189	yes	60
35190	no	59
35191	yes	32
35192	yes	29
35193	no	25
35194	yes	32
35195	yes	75
35196	yes	29
35197	yes	68
35198	yes	25
35199	yes	36
35200	no	34
35201	yes	38
35202	yes	53
35203	yes	34
35204	yes	23
35205	yes	73
35206	yes	25
35207	yes	51
35208	yes	72
35209	no	57
35210	no	37

	age	balance	day	duration	campaign	pdays	job_blue-collar	...	month_may	poutcome_unknown	deposit_no
0	38	677	14	114	2	-1	0.0	...	1.0	1.0	1.0
1	58	5445	14	391	1	-1	1.0	...	0.0	1.0	1.0
2	55	5	20	108	1	-1	0.0	...	0.0	1.0	1.0
3	26	63	28	76	4	-1	0.0	...	0.0	1.0	1.0
4	48	907	4	103	1	-1	0.0	...	0.0	1.0	1.0